DCR: Replay-Debugging for the Datacenter
نویسندگان
چکیده
We’ve built a tool for debugging non-deterministic failures in production datacenter applications. Our system, called DCR, is the first to efficiently record and replay large scale, distributed, and data-intensive systems such as HDFS/GFS, HBase/Bigtable, and Hadoop/MapReduce. The enabling idea behind DCR is that debugging doesn’t require a precise replica of the original datacenter run. Instead, it suffices to produce some run that exhibits the original control-plane behavior. This report details the design and implementation of DCR and provides preliminary results.
منابع مشابه
Replay Debugging for the Datacenter
Replay Debugging for the Datacenter by Gautam Deepak Altekar Doctor of Philosophy in Computer Science University of California, Berkeley Professor Ion Stoica, Chair Debugging large-scale, data-intensive, distributed applications running in a datacenter (“datacenter applications”) is complex and time-consuming. The key obstacle is non-deterministic failures—hard-to-reproduce program misbehaviors...
متن کاملAn Empirical Study of the Control and Data Planes (or Control Plane Determinism is Key for Replay Debugging Datacenter Applications)
Replay debugging systems enable the reproduction and debugging of non-deterministic failures in production application runs. However, no existing replay system is suitable for datacenter applications like Cassandra, Hadoop, and Hypertable. For these large scale, distributed, and data intensive programs, existing methods either incur excessive production overheads or don’t scale to multi-node, t...
متن کاملFocus Replay Debugging Effort on the Control Plane
Replay debugging systems enable the reproduction and debugging of non-deterministic failures in production application runs. However, no existing replay system is suitable for datacenter applications like Cassandra, Hadoop, and Hypertable. On these large scale, distributed, and data intensive programs, existing replay methods either incur excessive production recording overheads or are unable t...
متن کاملSimplifying Datacenter Network Debugging with PathDump
Datacenter networks continue to grow complex due to larger scales, higher speeds and higher link utilization. Existing tools to manage and debug these networks are even more complex, requiring in-network techniques like collecting per-packet per-switch logs, dynamic switch rule updates, periodically collecting data plane snapshots, packet mirroring, packet sampling, traffic replay, etc. This pa...
متن کاملDebug Determinism: The Sweet Spot for Replay-Based Debugging
Deterministic replay tools offer a compelling approach to debugging hard-to-reproduce bugs. Recent work on relaxed-deterministic replay techniques shows that replay debugging with low in-production overhead is possible. However, despite considerable progress, a replaydebugging system that offers not only low in-production runtime overhead but also high debugging utility, remains out of reach. T...
متن کامل